\section{Key Research Questions and Objectives}
\label{objetivos}

The key goal of the current research proposal is to develop a system capable of understanding instructions given in natural language (e.g., english) that are situated in a physical or virtual environment. The system that we envisage will not only interpret but also carry out, in real time, the instructions. Situated interpretation of instructions is a challenging problem due to two main reasons. First, natural languages are inherently rich: the same instruction can be said in infinitely many ways. And second, situated language is ellyptical not only to previous discourse but also to salient opportunities afforded by the situation: if there is only one salient thing to do next, an instruction like ``do it'' or simply ``yes'' is likely to be enough to get a person to do the action. 

We will tackle the problem of interpreting an utterance $u$ in a situation $s$ by casting it into a probabilistic classification problem: from all the reactions \emph{affordable} from $s$, what is the reaction that has the highest probability of being a reaction to $u$? This approach raises the following questions:

\begin{description}
\item[Situation Modeling] How can we calculate the actions that at affordable (i.e., possible) from a given situation? What is the level of granularity at which we should model these actions? How can we model the saliency of the affordable actions by observing human-human interactions in the domain?
\item[Classification] How can we rank the affordable actions according to their saliency and their probability of being a reaction to a given utterance? What are the relevant features (from the utterance and the situation) for training the classifier on human-human interactions?
\item[Management of misunderstandings] In case of a misunderstanding, how can the system update the probability distribution of its original interpretation based on the received correction?
\item[User Modeling] What are the relevant features from the interaction with a particular user that will help re-rank our potential interpretations? Does explicitely modelling the estimated common ground with the user~\cite{clark_1996} help understand him better? Can predictive statistical models for user modeling~\cite{zukerman-albrecht:PSMFUM} improve the interpretation?   
\end{description}

In our current inception regarding how should this research project evolve, there are two pivotal points around which every other question should be considered: the use of {\em cheap data}, and the issue of {\em reversibility}.

For the first one, we'll obtain a system such that it can learn automatically how to interpret instructions without requiring manual annotation (unlike typical statistical approaches, where manual annotation is always required one way or another). Attaining this objective would allow us to make use of large corpora of instructions and interactions, given that such corpora cannot be currently used due to the magnitude of the required effort that annotating it would require.

Finally, the second point, and perhaps one of the most complex issues we plan to tackle is {\em reversibility}: can we unify the internal model of an instruction interpretation system with the model of an instruction generation system? Given that both systems require access to a similar training corpus and they both consider a similar approach towards the understanding of the current environment, having a shared probability model (that is, the model through which we'll predict the results of a certain action, either because the systems wants to predict how a user will react to an instruction or because the system received an instruction and needs to select the appropriate reaction) would be incredible useful, as it would be the first step towards a fully conversational system. The question of whether such a shared model can be obtained is one we intend to answer on this research project.